01 - An Introduction
A modified Pareto principle succinctly explains statistics - 5% of the subject matter can be used to solve 95% of the real-world use cases. If everyone just understood the basics of statistics, society would be more analytical, able to reason through what data is telling us, and build their own experiments. Unfortunately, university courses tend to cover too much information too fast, and the topics covered in more advanced statistics diverge from reality.
As a data and geospatial engineer, I delve into statistics all the time - but the problem is that I’m constantly having to reference and look up the same things, over and over. This sequence of articles is designed to focus on the core statistics principles, broken up into digestible segments. Even if you simply learn what a data frame is, the difference between descriptive and inferential statistics and quantitative versus numerical data, you’ve got a great starting point.
I initially wrote about 300 pages of code and narrative - unfortunately, it turned into a slog - too much information on the code and the primary points were lost. I still link to a code repository, should you wish to follow along, but that will remain separate. Feel free to use whatever programming language or statistical tool you would like to follow along - I’ve used Python, R, MATLAB, JMP, Alteryx and MiniTab at one time, but I’ve realized that they can all get you to the same place.
Michael J. Crawley's, Introduction to Statistics Using R, is an excellent resource for learning both Statistics and R programming. He expertly weaves his way through the bare basics of stats and programming into intermediate statistical modeling. Combine a resource like this with some of the individual courses at DataCamp (a low-cost subscription to over 200 courses in data science), and you'll have a commanding knowledge to make an impact in any discipline.
Crawley has an ecology background, so the data is understandably ecological. However, even if you are more interested in data in social sciences, business, engineering or anything else, the same topics and techniques are immediately transferrable. If you understand Analysis of Variance in the context of glycogen found in rat livers, you can understand Analysis of Variance with application to a marketing campaign.
You can download the data at Crawley's university website at his bio site Imperial College of London. However, feel free to use your own data or follow Crawley’s book for more detail. The data is simply the method of conveying the information.